# Exploratory Data Analysis: White Wine Quality

This report explores a dataset containing attributes for 4898 instances of the Portuguese “Vinho Verde” white wine.

The attributes are the following:

  1. fixed acidity (tartaric acid - g / dm^3): most acids involved with wine are fixed or nonvolatile (do not evaporate readily).
  2. volatile acidity (acetic acid - g / dm^3): the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste.
  3. citric acid (g / dm^3): found in small quantities, citric acid can add ‘freshness’ and flavor to wines.
  4. residual sugar (g / dm^3): the amount of sugar remaining after fermentation stops. It’s rare to find wines with less than 1 g / dm^3 and wines with more than 45 g / dm^3 are considered sweet.
  5. chlorides (sodium chloride - g / dm^3): the amount of salt in the wine.
  6. free sulfur dioxide (mg / dm^3): the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion. It prevents microbial growth and the oxidation of wine.
  7. total sulfur dioxide (mg / dm^3): amount of free and bound forms of S02. In low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine.
  8. density (g / cm^3): the density of wine is close to that of water depending on the percent alcohol and sugar content.
  9. pH - describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale.
  10. sulphates (potassium sulphate - g / dm3): a wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidant.
  11. alcohol (% by volume) - the percent alcohol content of the wine.
  12. quality: score between 0 and 10 (based on sensory data).

The structure of the data:

## 'data.frame':    4898 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : int  6 6 6 6 6 6 6 6 6 6 ...

Univariate Plots Section

## [1] 4898   13
## 'data.frame':    4898 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : int  6 6 6 6 6 6 6 6 6 6 ...

Our dataset contains 4898 observations and 13 variables. The structure shows that all the variables are classified as numerical.

## [1] "3" "4" "5" "6" "7" "8" "9"

Here I converted the numerical variable quality to a factor variable.

##    3    4    5    6    7    8    9 
##   20  163 1457 2198  880  175    5

The distribution of quality appears to be normal. The median and the mean are almost the same. There are more than 2000 wines with a 6 rating. Since the quality is rated between 0 (very bad) and 10 (excellent), this means that most of the wines are above average.

## 
##    3    4    5    6    7    8    9 
##   20  163 1457 2198  880  175    5

Count of wines by ratings. We can clearly see that there are more than 200 wines rated 6.

Here we have the distribution of the percent alcohol content of the wine. It appears to be slightly skewed with the alcohol peaking at around 9.5.

We have a normal distribution of sulphates, wine additives which can contribute to sulfur dioxide gas levels, which acts as antimicrobials and antioxidants.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.720   3.090   3.180   3.188   3.280   3.820

pH describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic). Most of the wines are between 3 and 3.5 on the pH scale.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9871  0.9917  0.9937  0.9940  0.9961  1.0390

Density appears to be normally distributed across white wines.

Total sulfur dioxide represents the amount of free and bound forms of sulfur dioxide gas. In low concentrations, sulfur dioxide is mostly undetectable in wine, but at free concentrations over 50 grams/liter, sulfur dioxide becomes evident in the nose and taste of wine. Free sulfur dioxide prevents microbial growth and the oxidation of wine.

This histogram shows the amount of salt in white wines. The majority of wines have less than 0.1 gram/liter.

#5 number summary of residual sugar
summary(df$residual.sugar)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.600   1.700   5.200   6.391   9.900  65.800

This plot shows the distribution of the amount of sugar remaining after fermentation stops. Wines with greater than 45 grams/liter are considered sweet. Most of our wines have less than 10 grams/liter.

The citric acid is responsible for the wines’ “freshness” and flavors. Most of our wines have citric acid less than 0.5 grams/liter.

The volatile acidity histogram shows the distribution of the amount of acetic acid in wines, which at too high levels can lead to an unpleasant, vinegar taste. The peak is at around 0.25.

The fixed acidity histogram shows the tartaric acid of wines.

Univariate Analysis

What is the structure of your dataset?

There are 4898 white wine observations and 13 features: fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulphates, alcohol, quality.

What is/are the main feature(s) of interest in your dataset?

The main features in our data set are quality, alcohol %, pH, residual sugar, citric acid and volatile acidity. I suspect these and in combination with other variables determine the quality rating.

What other features in the dataset do you think will help support your
investigation into your feature(s) of interest?

I believe also free sulfur dioxide and total sulfur dioxide in relationship with the rest of the variables could contribute to my analysis of quality.

Bivariate Plots Section

There is a moderate positive correlation between alcohol and quality. We can see that as the alcohol increases the rating slightly increases as well.

There is no meaningful relationship between sulphates and quality. This wine additive has no impact on quality.

There is no meaningful relationship between pH and quality.

We have a small negative relationship between quality and density. As the density increases, the quality decreases.

From the plot above we see that there is a small negative relationship between total sulfur dioxide and quality. This means that as the total sulfur dioxide increases the quality decreases.

There is no meaningful correlation between free sulfur dioxide and quality.

## $y
## [1] "chlorides (sodium chloride - g / dm^3)"
## 
## attr(,"class")
## [1] "labels"

There is a small negative correlation between chlorides and quality. If the amount of salt increases the quality decreases.

There is no meaningful relationship between residual sugar and quality.

There is no meaningful relationship between citric acid and quality.

Small negative correlation between volatile acidity and quality.

There is no clear correlation between fixed acidity and quality.

We observe a strong negative correlation between density and alcohol. As the percent of alcohol increases the density decreases.

We have a moderate negative relationship between alcohol and total sulfur dioxide.

The plot above shows a moderate negative correlation between fixed acidity and pH.

This is a strong positive correlation between density and residual sugar.

This is a moderate positive correlation between density and total sulfur dioxide.

We observe a moderate positive correlation between total sulfur dioxide and free sulfur dioxide.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. How did the feature(s) of interest vary with other features in
the dataset?

Unexpectedly, the main features of interest that I listed above have no meaningful relationship with quality. The ones that have a correlation though, are alcohol which is moderate positive and volatile acidity which is small negative.

I have found that there is a small negative correlation between density and quality and between total sulfur dioxide and quality. The amount of salt also has a small impact in quality (chlorides).

Did you observe any interesting relationships between the other features
(not the main feature(s) of interest)?

I also found that there are some relationships between the features of wines. For example, I observed a strong negative correlation between density and alcohol. As the percent of alcohol increases the density decreases. There is also a moderate negative relationship between alcohol and total sulfur dioxide. There is a strong positive correlation between density and residual sugar.

What was the strongest relationship you found?

The strongest correlations I found are between other features. Strong positive correlation between residual sugar and density, as the amount of sugar increases the density increases. Another strong relationship was observed between density and alcohol.As the percent of alcohol increases the density decreases.

Multivariate Plots Section

In this plot we can see that the average alcohol percent is higher for the wines with higher quality rating.

## <ggproto object: Class ScaleDiscrete, Scale, gg>
##     aesthetics: colour
##     axis_order: function
##     break_info: function
##     break_positions: function
##     breaks: waiver
##     call: call
##     clone: function
##     dimension: function
##     drop: TRUE
##     expand: waiver
##     get_breaks: function
##     get_breaks_minor: function
##     get_labels: function
##     get_limits: function
##     guide: legend
##     is_discrete: function
##     is_empty: function
##     labels: waiver
##     limits: NULL
##     make_sec_title: function
##     make_title: function
##     map: function
##     map_df: function
##     n.breaks.cache: NULL
##     na.translate: TRUE
##     na.value: NA
##     name: waiver
##     palette: function
##     palette.cache: NULL
##     position: left
##     range: <ggproto object: Class RangeDiscrete, Range, gg>
##         range: NULL
##         reset: function
##         train: function
##         super:  <ggproto object: Class RangeDiscrete, Range, gg>
##     reset: function
##     scale_name: brewer
##     train: function
##     train_df: function
##     transform: function
##     transform_df: function
##     super:  <ggproto object: Class ScaleDiscrete, Scale, gg>

Here we have a histogram of alcohol wrapped by quality. There is a normal distribution in alcohol for the quality above 5. We can see here that wines rated above average peak around 11% alcohol.

In this plot we observe a strong negative relationship between alcohol and density, especially for the wines that are rated above 5 in quality.

We observe here a moderate negative relationship between alcohol and total sulfur dioxide wrapped by quality, especially for the wines that are rated above 5.

I omited here the 1% data in residual sugar. There is a strong relationship between density and residual sugar for wines rated 5 and above 5. As the amount of sugar increases the density increases.

This plot shows pH for each quality in relationship with fixed acidity.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?

I observed that wines which have a higher percent of alcohol are higher rated. Also I observed that wines which are higher rated, thus an increased percent of alcohol have a smaller density.

Were there any interesting or surprising interactions between features?

Higher rated wines have a lower total sulfur dioxide, which means that in low concentrations sulfur dioxide is mostly undetectable.

Higher rated wines have a lower amount of sugar than the other rated categories.


Final Plots and Summary

Plot One

## $title
## [1] "Alcohol Distribution by Quality"
## 
## attr(,"class")
## [1] "labels"

Description One

Here we have the distribution of the percent alcohol content of the wine. It appears to be slightly skewed with the alcohol peaking at around 9.5. However, if we look at the higher rated wines, we see that the peak is around 11. So, wines rated above average have a higher percent of alcohol.

Plot Two

Description Two

There is a small negative relationship between quality and amount of salt in wine. This means that higher rated wines have less salt than the lower rated wines. We can clearly see that as the mean decreases in salt the quality increases.

Plot Three

Description Three

In this plot we see that most the wines rated above average are less sweeter than the rest of the wines.

Reflection

I found that the wines that have a higher rating in quality have a higher percent of alcohol, are less sweeter and have less salt. I also found that there are also other features that have a meaningful impact in quality, like density total amount of sulfur dioxide.

I was expected that the main features like pH, residual sugar, acid citric and volatile acidity would have a direct impact over the quality of the wine, but it seems they have in combination with other features.

The challenges I enocountered were the fact the variables were not clearly explained as they represent chemical properties. We need to make sure we understand them so that we are able to present them in a concise and clear manner, so that the audience would know what we are talking about.

On the other hand, I found a tidy data set which was easy to work with as I didn’t have to struggle with cleaning the data.

Overall I am content about the data set I have analyzed and of the insights I found. I didn’t expect to find features that I wouldn’t think about to contribute to the wines quality ratings.

For the future I would perform further analyses outside EDA in order to confirm the findings or find some insights that perhaps are not obvious at this stage.

References

Udacity Resources